[FLINK-34995] flink kafka connector source stuck when partition leade… #91

yanspirit · 2024-04-07T07:50:15Z

when partition leader invalid(leader=-1), the flink streaming job using KafkaSource can't restart or start a new instance with a new groupid, it will stuck and got following exception:

"org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition aaa-1 could be determined"

when leader=-1, kafka api like KafkaConsumer.position() will block until either the position could be determined or an unrecoverable error is encountered

infact, leader=-1 not easy to avoid, even replica=3, three disk offline together will trigger the problem, especially when the cluster size is relatively large. it rely on kafka administrator to fix in time, but it take risk when in kafka cluster peak period.

This can be addressed by using the invalid leader filter and discovery partition interval.

…r invalid

boring-cyborg · 2024-04-07T07:50:35Z

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

MartijnVisser

Is it possible to add a test for this situation?

yanspirit · 2024-04-12T08:59:25Z

Is it possible to add a test for this situation?
This test is a bit tricky as it requires simulating broker crash. I have tested this locally, and it ignores partitions where leader=-1. Once the leader recovers, these partitions will be detected by the discovery-partition and added to process. Should I add a configuration switch for this optimization?

[FLINK-34995] flink kafka connector source stuck when partition leade…

b339f0e

…r invalid

boring-cyborg bot added the component=Connectors/Kafka label Apr 7, 2024

MartijnVisser reviewed Apr 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-34995] flink kafka connector source stuck when partition leade… #91

[FLINK-34995] flink kafka connector source stuck when partition leade… #91

yanspirit commented Apr 7, 2024

boring-cyborg bot commented Apr 7, 2024

MartijnVisser left a comment

yanspirit commented Apr 12, 2024

[FLINK-34995] flink kafka connector source stuck when partition leade… #91

Are you sure you want to change the base?

[FLINK-34995] flink kafka connector source stuck when partition leade… #91

Conversation

yanspirit commented Apr 7, 2024

boring-cyborg bot commented Apr 7, 2024

MartijnVisser left a comment

Choose a reason for hiding this comment

yanspirit commented Apr 12, 2024